<!--begin.rcode results='hide', echo=FALSE, message=FALSE
library(caret)
data(BloodBrain)

hook_inline = knit_hooks$get('inline')
knit_hooks$set(inline = function(x) {
  if (is.character(x)) highr::hi_html(x) else hook_inline(x)
  })
opts_chunk$set(comment=NA)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
    end.rcode-->

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--
    Design by Free CSS Templates
    http://www.freecsstemplates.org
    Released for free under a Creative Commons Attribution 2.5 License

    Name       : Emerald 
    Description: A two-column, fixed-width design with dark color scheme.
    Version    : 1.0
    Released   : 20120902

  -->
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="keywords" content="" />
    <meta name="description" content="" />
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <title>Data Sets</title>
    <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
    <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
    <div id="wrapper">
      <div id="header-wrapper" class="container">
	<div id="header" class="container">
	  <div id="logo">
	    <h1><a href="#">Data Sets</a></h1>
	  </div>
          <!--
	      <div id="menu">
		<ul>
		  <li class="current_page_item"><a href="#">Homepage</a></li>
		  <li><a href="#">Blog</a></li>
		  <li><a href="#">Photos</a></li>
		  <li><a href="#">About</a></li>
		  <li><a href="#">Contact</a></li>
		</ul>
	      </div>
              -->
	</div>
	<div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
      </div>
      <!-- end #header -->
      <div id="page">
	<div id="content">
          <p>
            There are a few data sets included in <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a>. The first four are computational chemistry problems where the object is to relate the molecular structure of compounds (via molecular descriptors) to some property of interest (<a href="http://www.sciencedirect.com/science/article/pii/S1359644699014518">Clark and Pickett (2000)</a>). Similar data sets can be found in the <a href="http://cran.r-project.org/web/packages/QSARdata/index.html"><strong>QSARdata</strong></a> R pacakge.
          </p>
          <p>
          Other R packages with data are: <a href="http://cran.r-project.org/web/packages/mlbench/index.html"><strong>mlbench</strong></a>, <a href="http://cran.r-project.org/web/packages/SMCRM/index.html"><strong>SMCRM</strong></a> and <a href="http://cran.r-project.org/web/packages/AppliedPredictiveModeling/index.html"><strong>AppliedPredictiveModeling</strong></a>.
          </p>
          <h2>Blood-Brain Barrier Data</h1>
          <p>
            <a href="http://www.springerlink.com/content/72j377175n536768/?p=f546488cc8fa4ec7a3d4911eb20adb3c&pi=0">Mente and Lombardo (2005)</a> developed models to predict the log of the ratio of the concentration of a compound in the brain and the concentration in blood. For each compound, they computed three sets of molecular descriptors: MOE 2D, rule-of-five and Charge Polar Surface Area (CPSA). In all, <!--rinline I(dim(bbbDescr)[2]) --> descriptors were calculated.  Included in this package are <!--rinline I(dim(bbbDescr)[1]) --> non-proprietary literature compounds. The vector <code>logBBB</code> contains the log concentration ratio and the data fame <code>bbbDescr</code> contains the descriptor values.
          </p>
          <h2>COX-2 Activity Data</h1>
          <p>
            From <a href="http://pubs.acs.org/cgi-bin/abstract.cgi/jmcmar/2004/47/i22/abs/jm0497141.html">Sutherland, O'Brien, and Weaver (2003)</a>: A set of 467 cyclooxygenase-2 (COX-2) inhibitors has been assembled from the published work of a single research group, with in vitro activities against human recombinant enzyme expressed as IC50 values ranging from 1 nM to &gt;100 uM (53 compounds have indeterminate IC50 values).
          </p>
          <p>
	    A set of 255 descriptors (MOE2D and QikProp) were generated. To classify the data, we used a cutoff of 2^{2.5} to determine activity.
	  </p>
	  <p>
	    Using <tt><!--rinline 'data(cox2)' --></tt> exposes three R objects: <code>cox2Descr</code> is a data frame with the descriptor data, <code>cox2IC50</code> is a numeric vector of IC50 assay values and <code>cox2Class</code> is a factor vector with the activity results.
	  </p>
	  <h2>DHFR Inhibition</h1>
          <p>
          <a href="http://www.springerlink.com/content/q5m5xp1q356p2071/">Sutherland and Weaver (2004)</a>  discuss QSAR models for dihydrofolate reductase (DHFR) inhibition. This data set contains values for 325 compounds. For each compound, 228 molecular descriptors have been calculated. Additionally, each samples is designated as "active" or "inactive".
</p>
<p>
The data frame <code>dhfr</code> contains a column called <code>Y</code> with the outcome classification. The remainder of the columns are molecular descriptor values.
          </p>
          <h2>Tecator NIR Data</h1>
<p>
These data can be found in the datasets section of <a href="http://lib.stat.cmu.edu/datasets/tecator">StatLib</at>. The data consist of 100 near infrared absorbance spectra used to predict the moisture, fat and protein values of chopped meat.
</p><p>
From  <a href="http://lib.stat.cmu.edu/datasets/tecator">StatLib</a>:
</p>
<blockquote> These data are recorded on a Tecator Infratec Food and Feed Analyzer 
working in the wavelength range 850 - 1050 nm by the Near Infrared 
Transmission (NIT) principle. Each sample contains finely chopped pure 
meat with different moisture, fat and protein contents.
If results from these data are used in a publication we want you to 
mention the instrument and company name (Tecator) in the publication. 
In addition, please send a preprint of your article to:  Karin Thente, Tecator AB, Box 70, S-263 21 Hoganas, Sweden.
</blockquote>
<p>
One reference for these data is Borggaard and Thodberg (1992).
</p><p>
Using <tt><!--rinline 'data(tecator)' --></tt> loads a 215 x 100 matrix of absorbance spectra and a 215 x 3 matrix of outcomes.
</p>

<h2>Fatty Acid Composition Data</h1>
<p>
<a href="http://dx.doi.org/10.1016/j.chemolab.2004.04.011">Brodnjak-Voncina et al. (2005)</a> describe a set of data where seven fatty acid compositions were used to classify commercial oils as either pumpkin (labeled <code>A</code>), sunflower (<code>B</code>), peanut (<code>C</code>), olive (<code>D</code>), soybean (<code>E</code>), rapeseed (<code>F</code>) and corn (<code>G</code>). There were 96 data points contained in their Table 1 with known results. The breakdown of the classes is given in below:
</p><p>
<!--begin.rcode oil1
data(oil)
dim(fattyAcids)
table(oilType)
    end.rcode--> 

</p><p>
As a note, the paper states on page 32 that there are 37 unknown samples while the table on pages 33 and 34 shows that there are 34 unknowns. 
</p>

<h2>German Credit Data</h1>
<p>
Data from Dr. Hans Hofmann of the University of Hamburg and stored at
the <a href="http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29">UC Irvine Machine Learning Repository</a>.
</p><p>
These data have two classes for the credit worthiness: good or bad. There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, Installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, Number of people being liable to provide maintenance for, telephone, and foreign worker status.
</p><p>
Many of these predictors are discrete and have been expanded into several 0/1 indicator variables
</p>
<!--begin.rcode GC
library(caret)
data(GermanCredit)
## Show the first 10 columns
str(GermanCredit[, 1:10])
    end.rcode-->


<h2>Kelly Blue Book resale data for 2005 model year GM cars</h1>
<p>
<a href="http://www.amstat.org/publications/jse/v16n3/datasets.kuiper.html">Kuiper (2008)</a> collected data on Kelly Blue Book resale data for 804 GM cars (2005 model year). 
</p><p>
<code>cars</code> is data frame of the suggested retail price (column <code>Price</code>) and various characteristics of each car (columns <code>Mileage</code>, <code>Cylinder</code>, <code>Doors</code>, <code>Cruise</code>, <code>Sound</code>, <code>Leather</code>, <code>Buick</code>, <code>Cadillac</code>, <code>Chevy</code>, <code>Pontiac</code>, <code>Saab</code>, <code>Saturn</code>, <code>convertible</code>, <code>coupe</code>, <code>hatchback</code>, <code>sedan</code> and <code>wagon</code>)
</p>
<!--begin.rcode cars
data(cars)
str(cars)
    end.rcode-->
<p></p>
<h2>Cell Body Segmentation Data</h1>
<p>
<a href="http://www.biomedcentral.com/1471-2105/8/340">Hill, LaPan, Li and Haney (2007)</a>) develop models to predict which cells in a high content screen were well segmented. The data consists of 119 imaging measurements on 2019. The original analysis used 1009 for training and 1010 as a test set (see the column called <code>Case</code>).
</p><p>
The outcome class is contained in a factor variable called <code>Class</code> with levels <code>PS</code> for poorly segmented and <code>WS</code> for well segmented.
</p>
  
<!--begin.rcode CS
data(segmentationData)
str(segmentationData[,1:10])
    end.rcode-->
	  <div style="clear: both;">&nbsp;</div>
	</div>
	<!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
	<div style="clear: both;">&nbsp;</div>
      </div>
      <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
      <!-- end #page -->
    </div>
    <div id="footer-content"></div>
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)    
    end.rcode--> 
 
    <div id="footer">
      <p>Created on <!--rinline I(session) -->.</p>
    </div>
    <!-- end #footer -->
  </body>
</html>
